hazardous scenario
MATRIX: Multi-Agent simulaTion fRamework for safe Interactions and conteXtual clinical conversational evaluation
Lim, Ernest, He, Yajie Vera, Joselowitz, Jared, Preston, Kate, Chowdhury, Mohita, Williams, Louis, Higham, Aisling, Mason, Katrina, Melo, Mariane, Lawton, Tom, Jia, Yan, Habli, Ibrahim
Despite the growing use of large language models (LLMs) in clinical dialogue systems, existing evaluations focus on task completion or fluency, offering little insight into the behavioral and risk management requirements essential for safety-critical systems. This paper presents MATRIX (Multi-Agent simulaTion fRamework for safe Interactions and conteXtual clinical conversational evaluation), a structured, extensible framework for safety-oriented evaluation of clinical dialogue agents. MATRIX integrates three components: (1) a safety-aligned taxonomy of clinical scenarios, expected system behaviors and failure modes derived through structured safety engineering methods; (2) BehvJudge, an LLM-based evaluator for detecting safety-relevant dialogue failures, validated against expert clinician annotations; and (3) PatBot, a simulated patient agent capable of producing diverse, scenario-conditioned responses, evaluated for realism and behavioral fidelity with human factors expertise, and a patient-preference study. Across three experiments, we show that MATRIX enables systematic, scalable safety evaluation. BehvJudge with Gemini 2.5-Pro achieves expert-level hazard detection (F1 0.96, sensitivity 0.999), outperforming clinicians in a blinded assessment of 240 dialogues. We also conducted one of the first realism analyses of LLM-based patient simulation, showing that PatBot reliably simulates realistic patient behavior in quantitative and qualitative evaluations. Using MATRIX, we demonstrate its effectiveness in benchmarking five LLM agents across 2,100 simulated dialogues spanning 14 hazard scenarios and 10 clinical domains. MATRIX is the first framework to unify structured safety engineering with scalable, validated conversational AI evaluation, enabling regulator-aligned safety auditing. We release all evaluation tools, prompts, structured scenarios, and datasets.
Make Full Use of Testing Information: An Integrated Accelerated Testing and Evaluation Method for Autonomous Driving Systems
Wu, Xinzheng, Chen, Junyi, Wu, Jianfeng, Zhang, Longgao, Xia, Tian, Shen, Yong
Testing and evaluation is an important step before the large-scale application of the autonomous driving systems (ADSs). Based on the three level of scenario abstraction theory, a testing can be performed within a logical scenario, followed by an evaluation stage which is inputted with the testing results of each concrete scenario generated from the logical parameter space. During the above process, abundant testing information is produced which is beneficial for comprehensive and accurate evaluations. To make full use of testing information, this paper proposes an Integrated accelerated Testing and Evaluation Method (ITEM). Based on a Monte Carlo Tree Search (MCTS) paradigm and a dual surrogates testing framework proposed in our previous work, this paper applies the intermediate information (i.e., the tree structure, including the affiliation of each historical sampled point with the subspaces and the parent-child relationship between subspaces) generated during the testing stage into the evaluation stage to achieve accurate hazardous domain identification. Moreover, to better serve this purpose, the UCB calculation method is improved to allow the search algorithm to focus more on the hazardous domain boundaries. Further, a stopping condition is constructed based on the convergence of the search algorithm. Ablation and comparative experiments are then conducted to verify the effectiveness of the improvements and the superiority of the proposed method. The experimental results show that ITEM could well identify the hazardous domains in both low- and high-dimensional cases, regardless of the shape of the hazardous domains, indicating its generality and potential for the safety evaluation of ADSs.
Safety Verification for Evasive Collision Avoidance in Autonomous Vehicles with Enhanced Resolutions
Arab, Aliasghar, Khaleghi, Milad, Partovi, Alireza, Abbaspour, Alireza, Shinde, Chaitanya, Mousavi, Yashar, Azimi, Vahid, Karimmoddini, Ali
This paper presents a comprehensive hazard analysis, risk assessment, and loss evaluation for an Evasive Minimum Risk Maneuvering (EMRM) system designed for autonomous vehicles. The EMRM system is engineered to enhance collision avoidance and mitigate loss severity by drawing inspiration from professional drivers who perform aggressive maneuvers while maintaining stability for effective risk mitigation. Recent advancements in autonomous vehicle technology demonstrate a growing capability for high-performance maneuvers. This paper discusses a comprehensive safety verification process and establishes a clear safety goal to enhance testing validation. The study systematically identifies potential hazards and assesses their risks to overall safety and the protection of vulnerable road users. A novel loss evaluation approach is introduced, focusing on the impact of mitigation maneuvers on loss severity. Additionally, the proposed mitigation integrity level can be used to verify the minimum-risk maneuver feature. This paper applies a verification method to evasive maneuvering, contributing to the development of more reliable active safety features in autonomous driving systems.
Statistical Modelling of Driving Scenarios in Road Traffic using Fleet Data of Production Vehicles
Reichenbächer, Christian, Hipp, Jochen, Bringmann, Oliver
Ensuring the safety of road vehicles at an acceptable level requires the absence of any unreasonable risk arising from all potential hazards linked to the intended au-tomated driving function and its implementation. The assurance that there are no unreasonable risks stemming from hazardous behaviours associated to functional insufficiencies is denoted as safety of intended functionality (SOTIF), a concept outlined in the ISO 21448 standard. In this context, the acquisition of real driving data is considered essential for the verification and validation. For this purpose, we are currently developing a method with which data collect-ed representatively from production vehicles can be modelled into a knowledge-based system in the future. A system that represents the probabilities of occur-rence of concrete driving scenarios over the statistical population of road traffic and makes them usable. The method includes the qualitative and quantitative ab-straction of the drives recorded by the sensors in the vehicles, the possibility of subsequent wireless transmission of the abstracted data from the vehicles and the derivation of the distributions and correlations of scenario parameters. This paper provides a summary of the research project and outlines its central idea. To this end, among other things, the needs for statistical information and da-ta from road traffic are elaborated from ISO 21448, the current state of research is addressed, and methodical aspects are discussed.
Multi-Agent Vulnerability Discovery for Autonomous Driving with Hazard Arbitration Reward
Liu, Weilin, Mu, Ye, Yu, Chao, Ning, Xuefei, Cao, Zhong, Wu, Yi, Liang, Shuang, Yang, Huazhong, Wang, Yu
Discovering hazardous scenarios is crucial in testing and further improving driving policies. However, conducting efficient driving policy testing faces two key challenges. On the one hand, the probability of naturally encountering hazardous scenarios is low when testing a well-trained autonomous driving strategy. Thus, discovering these scenarios by purely real-world road testing is extremely costly. On the other hand, a proper determination of accident responsibility is necessary for this task. Collecting scenarios with wrong-attributed responsibilities will lead to an overly conservative autonomous driving strategy. To be more specific, we aim to discover hazardous scenarios that are autonomous-vehicle responsible (AV-responsible), i.e., the vulnerabilities of the under-test driving policy. To this end, this work proposes a Safety Test framework by finding Av-Responsible Scenarios (STARS) based on multi-agent reinforcement learning. STARS guides other traffic participants to produce Av-Responsible Scenarios and make the under-test driving policy misbehave via introducing Hazard Arbitration Reward (HAR). HAR enables our framework to discover diverse, complex, and AV-responsible hazardous scenarios. Experimental results against four different driving policies in three environments demonstrate that STARS can effectively discover AV-responsible hazardous scenarios. These scenarios indeed correspond to the vulnerabilities of the under-test driving policies, thus are meaningful for their further improvements.